Background & Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement a a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here:http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise (WLE) Dataset).

The goal of the project is to predict the manner in which the exercise was carried out. This is the classe variable in the training set, using any other variables to predict with. I will described how model has been built, including how cross validation has been used. Expected out of sample error and details of assumptions and decisions made. Finally the model will be use to predict 20 different test cases.

For this task I will use Decision Trees, Random Forests and Gradient Boosting Machines (GBMs) to fit models to the data provided after some initial cleaning. These models are chosen as they lend themselves well to classification. I will carry out a Principal Components Analysis (PCA) to test the impact of reducing the features of the data on accuracy, while this may not work as well for tree based model methods chosen, I want to learn how to apply the techniques in the course. For completion I will also create an ensemble model using the two strongest model fitted using a stacking approach to see if we can get better results using a combination of models fitted - this is a common phenomina across data science competitions.

Note: You can use the floating table of contents on the top left of the document to navigate to the relevant parts. This functions for Google Chrome.

Libraries Required

You will need the following packages installed:

  • caret
  • rpart
  • randomForest
  • ggplot2
  • rattle
  • rpart.plot
  • corrplot
  • xgboost
  • DiagrammeR

Please refer to session information in the appendix to see packages used at the time of report creation.

Data Cleaning

In order to keep the write up of the project concise, a full breakdown of data cleaning steps are included in the appendix Appenix Data. A summary of the steps are included below

  • Imported data: Train set has 19622 rows and 160 columns; Test set has 20 rows and 160 columns
  • Initial summaries on data using str() discovered a large number of variables had a high proportion of N/A’s
  • Removed variables where the variance within a variable is near to zero using nearZeroVar() function - this step removed 60 variables
  • Removed variables where roportion of N/As or blanks were over 95% using additional code - this step removed a further 41 Variables
  • Inspecting the data set I can see that there are a number of variables which are not related to classe and are identification or timestamp variables:

  • X
  • user_name
  • raw_timestamp_part_1
  • raw_timestamp_part_2
  • cvtd_timestamp
  • num_window

  • Finally I set the variable classe as a factor using the function as.factor() for ease of future computations

The analysis may have benefited from both feature engineering and imputing missing values. However due to both time constraints and lack of subject knowledge I decided that the best course of action was to use the features as is. So we conclude the data cleaning phase with a dataset containing 53 variables.

Approach to Validation & Cross-Fold Validation

In order to create data sets to assess accuracy of models created, we will need to create a validation data set. This is an important step and since no model will be trained on this data its a good assessment of the likely model accuracy on data held out of the model build. This will be used to assess the likely impact on the test set accuracy. I have chosen to split the train data provided and cleaned in previous step into 70% train data and 30% validation data. The validation data will be excluded from all model fitting. For this I will use the function createDataPartition() in the caret package

inTrain <- createDataPartition(df_train$classe, p = 0.7, list = F)
df_train_final <- df_train[inTrain,]; dim(df_train_final)
## [1] 13737    53
df_valid_final <- df_train[-inTrain,]; dim(df_valid_final)
## [1] 5885   53

Finally on the concept of validation, Machine learning techniques often rely on Cross Fold Validation when fitting the model on the training data. As tree based methods are low bias and high variance while training our model we will break our data into k folds, for each fold the model is built on the remaing k-1 folds, then the model is tested on kth fold and the average of the k recorded erros is the cross validation error. The model selected is the one with the lowest average cross validation error. This is made very simple using the caret package. So we’ll set this up now and we’ll use the same cross validation approach for all models we are fitting.

cv_method <- trainControl(method = "cv", number = 5)

we could experiment with different cross validation approaches which may lead us to a better model, but due to time restraints I won’t look to test out different approaches in this report.

Aside: Principal Compnent Analysis (PCA)

Principal Component Analysis is a statistical procedure that converts a set of potentially correlated observations into a set of linearly uncorrelated variables called principal components.

First we check to see if there are correlations in our data by using cor() function to calculate and then corrplot() to plot these - to see if there is potential to use PCA on this data set

correlations <- cor(df_train_final[,-53]) #REMOVE NON NUMERIC VARIABLES classe
corrplot(correlations, order = "FPC", method = "color", type = "lower", tl.cex = 0.8, tl.col = "grey50")

We can see from this plot that there are some variables which are heavily correlated with others. this suggests that there may be some benefit in carrying out principal component analysis to reduce the number of variables. The caret package again makes this very easy for us using the function preProcess but we could also use prcomp() in the stats package. Here the variables are centered and scaled. the preProcess command keeps all the components that explain 95% of the variance.

pca <- preProcess(df_train_final[,-53], method = "pca")
#CARET KEEPS THE COMPONENTS THAT EXPLAIN 95% OF THE VARIANCE
df_train_final_pca <- predict(pca, df_train_final[,-53])
df_train_final_pca$classe <- df_train_final$classe
df_valid_final_pca <- predict(pca, df_valid_final[,-53])
df_valid_final_pca$classe <- df_valid_final$classe
df_test_pca <- predict(pca, df_test[,-53])

# OUTPUT PCA 1 & 2
pca$rotation[,1:2]
##                               PC1          PC2
## roll_belt            -0.301726384  0.146053040
## pitch_belt           -0.038366423 -0.290413621
## yaw_belt             -0.188404289  0.262352869
## total_accel_belt     -0.299480812  0.126121475
## gyros_belt_x          0.105712118  0.187047255
## gyros_belt_y         -0.090397747  0.211728149
## gyros_belt_z          0.184683583  0.035613573
## accel_belt_x          0.023766707  0.293390066
## accel_belt_y         -0.315362047  0.054923453
## accel_belt_z          0.311985968 -0.122195446
## magnet_belt_x        -0.002060574  0.290849822
## magnet_belt_y         0.122823971  0.081144803
## magnet_belt_z         0.067472466  0.117707455
## roll_arm              0.054550696 -0.181479280
## pitch_arm             0.036594767  0.068701247
## yaw_arm               0.042974510 -0.119380535
## total_accel_arm       0.111198712 -0.042260527
## gyros_arm_x          -0.005520767  0.051720332
## gyros_arm_y           0.068144563 -0.084268001
## gyros_arm_z          -0.145975336  0.193374637
## accel_arm_x          -0.161076646 -0.100805071
## accel_arm_y           0.260111743 -0.138654826
## accel_arm_z          -0.131451710  0.004038981
## magnet_arm_x         -0.084065781 -0.007403693
## magnet_arm_y          0.058993576  0.024725520
## magnet_arm_z          0.026395409  0.027272424
## roll_dumbbell         0.093504410  0.124797837
## pitch_dumbbell       -0.117517655 -0.143170002
## yaw_dumbbell         -0.138388193 -0.260302886
## total_accel_dumbbell  0.176154265  0.139562813
## gyros_dumbbell_x     -0.007210618 -0.007767173
## gyros_dumbbell_y      0.002770346  0.038056125
## gyros_dumbbell_z      0.003928183  0.003811731
## accel_dumbbell_x     -0.177452800 -0.130438788
## accel_dumbbell_y      0.190830731  0.171467951
## accel_dumbbell_z     -0.166947903 -0.239673440
## magnet_dumbbell_x    -0.179475473 -0.185579194
## magnet_dumbbell_y     0.156387429  0.162547521
## magnet_dumbbell_z     0.170194214 -0.030870590
## roll_forearm          0.061834998 -0.049863044
## pitch_forearm        -0.146424488 -0.097065570
## yaw_forearm           0.110576540 -0.042803361
## total_accel_forearm   0.002690407  0.096696436
## gyros_forearm_x      -0.062940869  0.196689755
## gyros_forearm_y       0.001773031  0.018333283
## gyros_forearm_z       0.007353039  0.021463753
## accel_forearm_x       0.185914146 -0.098848606
## accel_forearm_y       0.038381941  0.090458338
## accel_forearm_z      -0.029141106  0.040862136
## magnet_forearm_x      0.103973218 -0.013893603
## magnet_forearm_y      0.025999126  0.048654460
## magnet_forearm_z     -0.033233470  0.111861241
plot(df_train_final_pca[,1],df_train_final_pca[,2], col = df_train_final$classe)

# CALCULATE CUMULATIVE %
mod_prcomp <- prcomp(df_train_final[,-53], center = T, scale. = T)
#mod_prcomp$rotation[,1:2]
summary(mod_prcomp)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.8856 2.8503 2.17457 2.08178 1.90619 1.73392
## Proportion of Variance 0.1601 0.1562 0.09094 0.08334 0.06988 0.05782
## Cumulative Proportion  0.1601 0.3164 0.40730 0.49064 0.56052 0.61834
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     1.49347 1.43649 1.31834 1.22495 1.17547 1.06300
## Proportion of Variance 0.04289 0.03968 0.03342 0.02886 0.02657 0.02173
## Cumulative Proportion  0.66123 0.70091 0.73434 0.76319 0.78976 0.81149
##                           PC13    PC14    PC15    PC16    PC17    PC18
## Standard deviation     0.99698 0.93321 0.90014 0.88453 0.82619 0.75172
## Proportion of Variance 0.01911 0.01675 0.01558 0.01505 0.01313 0.01087
## Cumulative Proportion  0.83061 0.84736 0.86294 0.87798 0.89111 0.90198
##                           PC19    PC20    PC21    PC22    PC23    PC24
## Standard deviation     0.71704 0.69235 0.64343 0.62378 0.60573 0.57287
## Proportion of Variance 0.00989 0.00922 0.00796 0.00748 0.00706 0.00631
## Cumulative Proportion  0.91187 0.92108 0.92905 0.93653 0.94358 0.94990
##                          PC25    PC26    PC27   PC28    PC29    PC30
## Standard deviation     0.5491 0.53760 0.50316 0.4838 0.44651 0.41519
## Proportion of Variance 0.0058 0.00556 0.00487 0.0045 0.00383 0.00331
## Cumulative Proportion  0.9557 0.96125 0.96612 0.9706 0.97445 0.97777
##                           PC31   PC32    PC33    PC34    PC35    PC36
## Standard deviation     0.38083 0.3609 0.34735 0.33509 0.30446 0.28455
## Proportion of Variance 0.00279 0.0025 0.00232 0.00216 0.00178 0.00156
## Cumulative Proportion  0.98056 0.9831 0.98538 0.98754 0.98932 0.99088
##                           PC37    PC38    PC39    PC40   PC41    PC42
## Standard deviation     0.25454 0.23814 0.23517 0.20303 0.1914 0.18288
## Proportion of Variance 0.00125 0.00109 0.00106 0.00079 0.0007 0.00064
## Cumulative Proportion  0.99213 0.99322 0.99428 0.99507 0.9958 0.99642
##                           PC43    PC44    PC45    PC46    PC47    PC48
## Standard deviation     0.17984 0.16643 0.16442 0.16407 0.14733 0.14248
## Proportion of Variance 0.00062 0.00053 0.00052 0.00052 0.00042 0.00039
## Cumulative Proportion  0.99704 0.99758 0.99810 0.99861 0.99903 0.99942
##                           PC49    PC50    PC51    PC52
## Standard deviation     0.11267 0.09618 0.07717 0.04638
## Proportion of Variance 0.00024 0.00018 0.00011 0.00004
## Cumulative Proportion  0.99967 0.99984 0.99996 1.00000
# ADD CUMULATIVE PLOT - IN OTHER R SESSION
vars <- apply(mod_prcomp$x, 2, var)
props <- cumsum(vars/sum(vars))
pca_plot <- as.data.frame(props)
pca_plot$PCA_Num <- seq.int(nrow(pca_plot))
names(pca_plot)[names(pca_plot) == "props"] <- "Cumulative_Variance"

# CREATE PLOT
g <- ggplot(data = pca_plot, aes(x=PCA_Num, y=Cumulative_Variance)) 
g + geom_line() +
  geom_point() +
  ggtitle("Principal Component Analysis", 
          subtitle = "Cumulative Variance Explained up to and including Principal Component") +
  ylab("Variance Explained") +
  xlab("Principal Component #") +
  geom_hline(yintercept = 0.95, colour = "red", size = 1, linetype = "dashed") +
  geom_text(aes(label = "95% Variance Threshold"), 
            size = 4, x = 10, y = 0.967, colour = "red" ) 

From this we can see that:

  • 25 principal components explain 95% of the variance in the model - see graph above
  • we have output the first two principal components above to show how these are linear constructs of other factors
  • a plot of the cumulative variance explained by PCA has been created which shows that the first principal component explains the largest proportion of variance and each subsequent component explains less than its predecessor
  • Highlights that the first 25 Components explains 95% of variance.
  • We can see that although we have less variables to model, we lose transparency, for example we can see from the output above showing the construction of the first two factors that each Principal component is a linear construct of the variables in the model.

Summary Pre Modelling

After our data cleaning, splitting and pre processing has been completed we have 6 datasets we will use in the remainder of the analysis:

  • df_train_final : 70% of original train data, with irrelevant, largely missing and variables with close to zero variance removed. Contains 53 variables and 13,737 rows. Will be used to train models
  • df_valid_final : 30% of original train data, with irrelevant, largely missing and variables with close to zero variance removed. Contains 53 variables and 5,885 rows. Will be used to assess expected accuracy of trained models before we apply to test dataset
  • df_train_final_pca : 70% of original train data, with irrelevant, largely missing and variables with close to zero variance removed and principal components analysis applied to remaining variables and classe variable appended. Contains 26 variables and 13,737 rows. Will be used to train models
  • df_valid_final_pca : 30% of original train data, with irrelevant, largely missing and variables with close to zero variance removed and principal components analysis applied to remaining variables and classe variable appended. Contains 26 variables and 5,885 rows. Will be used to assess expected accuracy of trained models before we apply to test dataset, and also to compare if PCA adds value to this exercise
  • df_test : Original Test data with same transformations as applied to df_train_final. 53 variables and 20 observations. Will be used to score final models and submit quiz
  • df_test_pca : Original Test data with same transformations as applied to df_train_final_pca. 26 variables and 20 observations. Will be used to score final models and submit quiz if PCA proves more succesful than using raw data features

We will fit 3 types of models to these datasets and an ensemble model of the best two. The following methods will be used:

  • Decision Tree: The most basic approach used in this project, produces one tree using the rpart and caret package - these models are very simple to understand and visualise but with low complexity we expect these models to be lowest accuracy
  • Random Forests: A more complex approach that relies on a multitude of weighted decision trees, produced using the rf method in caret - since there is a multitude of trees, the model is more difficult to interpret, but we expect it to have a higher accuracy on our validation data than a simple decision tree.
  • Gradient Boosting Machines: A further variation on the decision tree concept where the first predictor explains the largest proportion of variance in the data, then subsequent predictors are fitted on the residuals, this continues iteratively until a stopping criteria is hit, typically the predictors used are decision trees. This is even less interpretable than Random Forests but we expect a higher accuracy.
  • Ensemble : I will take the best two models created and weight them using a process called stacking, such that the model will be a blend of these models. Typically in Machine learning competitions on Kaggle, winning submissions use ensemble methods and as its touched on in the course I want to test this approach out on a real life example.

Building The Models

Before we start modelling its important to create a seed so that results are reproducible. While I could just initialise this once for the project, I prefer to reset the seed before every new model is fitted so you will frequently see the short line of code set.seed(2808) the choice of 2808 as the seed is arbitrary.

Throughout this section I use Accuracy as the metric to assess predictive power of my models.

1. Decision Tree

For the decision tree models we will use the function train() in caret which uses the rpart package, this function calls our chosen cross validation method discussed above to reduce potential overfit

# Using variables as provided
set.seed(2808)
model_tree <- train(classe ~ ., data = df_train_final, method = "rpart", tuneLength = 50, metric = "Accuracy", trControl = cv_method)
# Using PCA
set.seed(2808)
model_tree_pca <- train(classe ~ ., data = df_train_final_pca, method = "rpart", tuneLength = 50, metric = "Accuracy", trControl = cv_method)

Now we use the predict() function to score these models onto our validation data. Then we can fit a matrix using the Actual classe outcome in this validation dataset against the predicted classe using the confusionMatrix() function and use this to assess the accuracy of our models.

# Non PCA Tree
predict_tree <- predict(model_tree, newdata = df_valid_final)
CM_tree <- confusionMatrix(predict_tree, df_valid_final$classe)
CM_tree
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1617   64   10    9   14
##          B   15  962   42   16   37
##          C   11   55  938   41   20
##          D   13   33   27  879   28
##          E   18   25    9   19  983
## 
## Overall Statistics
##                                           
##                Accuracy : 0.914           
##                  95% CI : (0.9066, 0.9211)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8912          
##  Mcnemar's Test P-Value : 2.245e-07       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9659   0.8446   0.9142   0.9118   0.9085
## Specificity            0.9770   0.9768   0.9739   0.9795   0.9852
## Pos Pred Value         0.9434   0.8974   0.8808   0.8969   0.9326
## Neg Pred Value         0.9863   0.9632   0.9817   0.9827   0.9795
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2748   0.1635   0.1594   0.1494   0.1670
## Detection Prevalence   0.2912   0.1822   0.1810   0.1665   0.1791
## Balanced Accuracy      0.9715   0.9107   0.9440   0.9457   0.9469
# PCA Tree
predict_tree_pca <- predict(model_tree_pca, newdata = df_valid_final_pca)
CM_tree_pca <- confusionMatrix(predict_tree_pca, df_valid_final_pca$classe)
CM_tree_pca
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1336  184   73   72   54
##          B  112  669  113   72  156
##          C  135  137  778  163   90
##          D   64   72   14  601   57
##          E   27   77   48   56  725
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6982          
##                  95% CI : (0.6863, 0.7099)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6179          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.7981   0.5874   0.7583   0.6234   0.6701
## Specificity            0.9090   0.9046   0.8920   0.9579   0.9567
## Pos Pred Value         0.7772   0.5963   0.5971   0.7438   0.7771
## Neg Pred Value         0.9189   0.9013   0.9459   0.9285   0.9279
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2270   0.1137   0.1322   0.1021   0.1232
## Detection Prevalence   0.2921   0.1907   0.2214   0.1373   0.1585
## Balanced Accuracy      0.8536   0.7460   0.8251   0.7907   0.8134

We can see from the output above that the tree fitted on the variables as provided has an accuracy of 91.4% while the model fitted on the data reduced used PCA has an accuracy of 69.82%

This shows us there is an expected drop off in prediction power if we use PCA on the data to reduce the number of features. Also an Accuracy of 91.4% is not a bad start point given a simplistic model.

2. Random Forest

Extending a tree to a forest, we expect a jump in predictive power.

# NON PCA
set.seed(2808)
model_rf <- train(classe ~., data = df_train_final, method = "rf", trControl = cv_method)

# PCA
set.seed(2808)
model_rf_pca <- train(classe ~., data = df_train_final_pca, method = "rf", trControl = cv_method)

Again predicting these on to the validation dataset and assessing accuracy.

#PREDICT MODELS ONTO VALIDATION DATA
predict_rf <- predict(model_rf, newdata = df_valid_final)
predict_rf_pca <- predict(model_rf_pca, newdata = df_valid_final_pca)

#NON PCA confusion Matrix
CM_rf <- confusionMatrix(predict_rf, df_valid_final$classe)
CM_rf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    2    0    0    0
##          B    1 1136    9    0    0
##          C    0    1 1016   12    1
##          D    0    0    1  951    4
##          E    0    0    0    1 1077
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9946          
##                  95% CI : (0.9923, 0.9963)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9931          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9974   0.9903   0.9865   0.9954
## Specificity            0.9995   0.9979   0.9971   0.9990   0.9998
## Pos Pred Value         0.9988   0.9913   0.9864   0.9948   0.9991
## Neg Pred Value         0.9998   0.9994   0.9979   0.9974   0.9990
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1930   0.1726   0.1616   0.1830
## Detection Prevalence   0.2846   0.1947   0.1750   0.1624   0.1832
## Balanced Accuracy      0.9995   0.9976   0.9937   0.9927   0.9976
#PCA confusion Matrix
CM_rf_pca <- confusionMatrix(predict_rf_pca, df_valid_final_pca$classe)
CM_rf_pca
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1667   18    3    1    0
##          B    3 1103   18    1    4
##          C    1   15  997   29    8
##          D    2    0    7  931    9
##          E    1    3    1    2 1061
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9786          
##                  95% CI : (0.9746, 0.9821)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9729          
##  Mcnemar's Test P-Value : 4.101e-05       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9958   0.9684   0.9717   0.9658   0.9806
## Specificity            0.9948   0.9945   0.9891   0.9963   0.9985
## Pos Pred Value         0.9870   0.9770   0.9495   0.9810   0.9934
## Neg Pred Value         0.9983   0.9924   0.9940   0.9933   0.9956
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2833   0.1874   0.1694   0.1582   0.1803
## Detection Prevalence   0.2870   0.1918   0.1784   0.1613   0.1815
## Balanced Accuracy      0.9953   0.9815   0.9804   0.9811   0.9896

The Accuracy here is much better, as we’d expect, the model fitted on the variables as provided has an accuracy of 99.46% while the model fitted on the data reduced used PCA has an accuracy of 97.86%.

We can see that the predictive power of PCA based model has closed the gap considerably on the model based on variables as provided, however it still lags.

3. GRADIENT BOOSTING MACHINES (GBM)

With the accuracy of the Random Forest model so high, we could have simply used it to predict the classes of our 20 test cases, however for completion and to enable me to better understand the concepts of the course, I’ll continue as planned. Again, we fit the models.

# NON PCA MODEL
set.seed(2808)
model_gbm <- train(classe ~., data = df_train_final, method = "xgbTree", trControl = cv_method)
# PCA MODEL
set.seed(2808)
model_gbm_pca <- train(classe ~., data = df_train_final_pca, method = "xgbTree", trControl = cv_method)

Using these models to predict the outcome of the validation data and assessing accuracy of predictions vs actual values.

#PREDICT VALIDATION DATA OUTCOMES
predict_gbm <- predict(model_gbm, newdata = df_valid_final)
predict_gbm_pca <- predict(model_gbm_pca, newdata = df_valid_final_pca)

#ACCURACY METRICS - NON PCA
CM_gbm <- confusionMatrix(predict_gbm, df_valid_final$classe)
CM_gbm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    2    0    0    0
##          B    1 1137    2    3    0
##          C    0    0 1020   11    1
##          D    0    0    4  949    2
##          E    0    0    0    1 1079
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9954         
##                  95% CI : (0.9933, 0.997)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9942         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9982   0.9942   0.9844   0.9972
## Specificity            0.9995   0.9987   0.9975   0.9988   0.9998
## Pos Pred Value         0.9988   0.9948   0.9884   0.9937   0.9991
## Neg Pred Value         0.9998   0.9996   0.9988   0.9970   0.9994
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1932   0.1733   0.1613   0.1833
## Detection Prevalence   0.2846   0.1942   0.1754   0.1623   0.1835
## Balanced Accuracy      0.9995   0.9985   0.9958   0.9916   0.9985
#ACCURACY METRICS - PCA
CM_gbm_pca <- confusionMatrix(predict_gbm_pca, df_valid_final_pca$classe)
CM_gbm_pca
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1631   45    6    6    6
##          B   14 1041   35    6   23
##          C   13   35  951   51   19
##          D   12    5   28  893   19
##          E    4   13    6    8 1015
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9398          
##                  95% CI : (0.9335, 0.9458)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9239          
##  Mcnemar's Test P-Value : 7.276e-06       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9743   0.9140   0.9269   0.9263   0.9381
## Specificity            0.9850   0.9836   0.9757   0.9870   0.9935
## Pos Pred Value         0.9628   0.9303   0.8896   0.9331   0.9704
## Neg Pred Value         0.9897   0.9794   0.9844   0.9856   0.9862
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2771   0.1769   0.1616   0.1517   0.1725
## Detection Prevalence   0.2879   0.1901   0.1816   0.1626   0.1777
## Balanced Accuracy      0.9797   0.9488   0.9513   0.9567   0.9658

The Accuracy here is slightly better than the Random forest models, the model fitted on the variables as provided has an accuracy of 99.54% while the model fitted on the data reduced used PCA has an accuracy of 93.98%.

4. ENSEMBLING

While either of the Random Forest model or GBM models using the actual variables in the models rather than the reduced variable set under PCA would be sufficient for the testing datasets given accuracies of over 90%, we will attempt to model stack, we don’t really expect much of an improvement in accuracy of models, again I’m just using this as an opportunity to apply some of the theory from the course.

The steps of fitting a stacked model are as follows:

  • Create a data frame which includes the variable we are trying to predict, and the predictions of the models we want to ensemble on the training data
  • Fit a new machine learning model using the formula target variable ~ model_1 + …. model_n
  • Predict this models onto validation data and use confusion matrix to assess likely accuracy
  • Decide on whether to use ensemble as final model or one of the models on their own for the test data
#GET PREDITIONS FROM GBM MODEL FOR TRAINING DATASET
predict_gbm_stack <- predict(model_gbm, newdata = df_train_final)
#GET PREDITIONS FROM RANDOM FOREST MDOEL FOR TRAINING DATASET
predict_rf_stack <- predict(model_rf, newdata = df_train_final)

#CREATE A DATAFRAME TO COLLECT MODEL PREDICTIONS AND CLASSE FACTOR TOGETHER
df_stacked <- data.frame(predict_gbm = predict_gbm_stack, predict_rf = predict_rf_stack, classe = df_train_final$classe)

#NOW FIT MODEL
set.seed(2808)
model_stacked <- train(classe ~ predict_gbm + predict_rf, data = df_stacked, trControl = cv_method, method = "xgbTree")

Now Predict onto validation data and assess accuracy

#PREDICT ONTO VALIDATION DATA
predict_stacked <- predict(model_stacked, newdata = df_valid_final)
#TEST ACCURACY OF STACKED MODEL
CM_stacked <- confusionMatrix(predict_stacked, df_valid_final$classe)
CM_stacked
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    2    0    0    0
##          B    1 1137    2    3    0
##          C    0    0 1020   11    1
##          D    0    0    4  949    2
##          E    0    0    0    1 1079
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9954         
##                  95% CI : (0.9933, 0.997)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9942         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9982   0.9942   0.9844   0.9972
## Specificity            0.9995   0.9987   0.9975   0.9988   0.9998
## Pos Pred Value         0.9988   0.9948   0.9884   0.9937   0.9991
## Neg Pred Value         0.9998   0.9996   0.9988   0.9970   0.9994
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1932   0.1733   0.1613   0.1833
## Detection Prevalence   0.2846   0.1942   0.1754   0.1623   0.1835
## Balanced Accuracy      0.9995   0.9985   0.9958   0.9916   0.9985

We can see that the stacked model has an accuracy of 99.54% the same as the GBM model accuracy of 99.54% and the Random Forest model accuracy of 99.46%.

We see in Kaggle competitions the winning entry is typically a stacked model so that we expect stacked models to outperform models from single methods. Why is this not the case here?

  • The accuracy of both models pre ensemble is > 99%; therefore any additional model will struggle to beat the models
  • We are using a validation set which is 30% of the original training data to assess accuracy and this may be subject to noise and lead us to conclude GBM model is superior, on a larger set of data we’d have more confidence in our result.
  • Finally the GBM model may actually be superior to both the stacked and Random forest model.

To investigate this last point really quickly we can look at the validation data actual classe predictions versus the predictions of the GBM and random forest model - is it clear that one of these models is superior to the other where the prediction differs?

First create a dataset containing the predictions of the two models on the validation data and the actual classe variable. Now create a subset of this dataset where the prediction for the GBM model is different to Random forest model.

# LETS SEE WHERE PREDICTIONS DIFFER BETWEEN GBM AND RF MODEL...
df_valid_stack <- data.frame(predict_gbm = predict_gbm, predict_rf = predict_rf, classe = df_valid_final$classe)
df_stack_mismatch <- df_valid_stack[df_valid_stack$predict_gbm != predict_rf,]

There are only 39 instances out of 5885 rows in the validation dataset where this mismatch occurs so less than 1%. Lets look at the first few rows of this data. Then finally calculate the percentage of mismatches which are correctly predicted by the GBM model.

head(df_stack_mismatch)
##      predict_gbm predict_rf classe
## 638            A          B      A
## 1174           B          A      A
## 1837           A          B      B
## 2398           B          A      B
## 2404           A          B      B
## 2590           B          A      B
sum(df_stack_mismatch$predict_gbm == df_stack_mismatch$classe) / dim(df_stack_mismatch)[1]
## [1] 0.5641026

this shows us that 0.564102564102564% of the mismatches are better predicted by the GBM model, hence the GBM model is superior to the Random Forest model as we seen earlier, but also that means its difficult for the stacked model to improve on the expected gbm accuracy.

Summary: Chosing The Final Model

As we have fitted a lot of models above, pull all the results into one data frame and plotting to summarise the results above.

Accuracy <- c(CM_tree$overall[1], CM_tree_pca$overall[1], CM_rf$overall[1], CM_rf_pca$overall[1], CM_gbm$overall[1], CM_gbm_pca$overall[1], CM_stacked$overall[1] )
PCA_Model <- c("N", "Y", "N", "Y","N", "Y", "N")
X <- c(1,1,2,2,3,3,4)
Model_Name <- c("Decision Tree", "Decision Tree", "Random Forest", "Random Forest", "GBM", "GBM", "Ensemble") 

df_results <- data.frame(Model_Name, PCA_Model, Accuracy, X)
library(ggrepel)
ggplot(data = df_results, aes(x=X, y = Accuracy, color = PCA_Model)) +
  geom_point() +
  theme(axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank()) +
  ggtitle("Model Accuracy", 
          subtitle = "Expected Accuracy based on Validation Data") +
  ylab("Accuracy") +
  geom_label_repel(aes(label= paste0(Model_Name,";", round(Accuracy,4))), 
                       arrow = arrow(length = unit(0.03, "npc"), type = "closed", ends = "first"),
                       force = 10,
                       box.padding = 0.35, point.padding = 0.5, label.size = 0.1, show.legend = F) +
  ylim(0.6,1.05)

This shows:

  • In all model types the model fitted on the data where the number of variables was reduced using Principal component analysis is worse than the corresponding model without PCA (i.e. using raw features) - so reject the PCA candidate models
  • GBM model outperforms the Random Forest Model, however both have accuracy over 99%, so we’d expect both to do well when scoring our test data - Reject Random Forest Model
  • The expected Accuracy / Error rate is the same for GBM and Ensemble model, however the GBM model is simpler - Reject Ensemble Model

We will therefore Select the Model fitted using a GBM approach to predict the outcome of the test data, we expect the accuracy of the prediction to be 99.54%.

so now we’ve selected our model, lets find out a little more detail about it. Using the function VarImp() in the Caret package we can see which features in the model have most influence in the model. Importance is calculated for a single decision tree by the amount each attribute split point improves the performance measre, weighted by number of observations node is responsible for. The Feature imoprtances are then averaged across all of the weak predictors in the model. Lets plot:

importance <- varImp(model_gbm, scale = F)
  plot(importance, main = "GBM Model: Feature Importance")

This shows us the following: * the most important features in the mode are the variables yaw_belt and roll_belt * There are a large number of factors used in the model - highlighting the issue with transparency of Machine Learning models

Lets look at the first 2 trees fitted in the GBM model using the function xgb.plot.tree() in the xgboost package

 xgb.plot.tree(model = model_gbm$finalModel, trees =1:2 )

This shows the first 2 trees fitted in the gbm model, which gives us some insight into the complexity of the model, however it only touches on it as there are 749 trees in total fitted to the data.

Predicting Onto Test Data

The final part of the project involves predicting the outcome of the 20 test cases in the test dataset. As discussed in the summary section we will use the GBM model fitted on the variable as provided (ie. not using PCA).

predict_test <- data.frame(problem_id = df_test$problem_id, predicted_classe = predict(model_gbm, newdata = df_test))
predict_test
##    problem_id predicted_classe
## 1           1                B
## 2           2                A
## 3           3                B
## 4           4                A
## 5           5                A
## 6           6                E
## 7           7                D
## 8           8                B
## 9           9                A
## 10         10                A
## 11         11                B
## 12         12                C
## 13         13                B
## 14         14                A
## 15         15                E
## 16         16                E
## 17         17                A
## 18         18                B
## 19         19                B
## 20         20                B

Appendix

Data Cleaning

First we’ll import the data and run some simple summaries using dim() and str() functions

df_train <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"),header=TRUE)
df_test <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"),header=TRUE)
dim(df_train) #19622 Entries; 160 variables
## [1] 19622   160
dim(df_test)  #20 Entries; 160 variables
## [1]  20 160
str(df_train)
## 'data.frame':    19622 obs. of  160 variables:
##  $ X                       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ user_name               : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084231 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 808298 820366 120339 196328 304277 368296 440390 484323 484434 ...
##  $ cvtd_timestamp          : Factor w/ 20 levels "02/12/2011 13:32",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ new_window              : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ num_window              : int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt              : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ kurtosis_roll_belt      : Factor w/ 397 levels "","-0.016850",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_belt     : Factor w/ 317 levels "","-0.021887",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_belt       : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_belt      : Factor w/ 395 levels "","-0.003095",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_belt.1    : Factor w/ 338 levels "","-0.005928",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_belt       : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_belt            : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ min_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_belt          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_belt            : Factor w/ 68 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ amplitude_roll_belt     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_belt    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_belt      : Factor w/ 4 levels "","#DIV/0!","0.00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ var_total_accel_belt    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_belt        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_belt           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_belt       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_belt          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_belt         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_belt            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_belt_x            : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y            : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z            : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x            : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y            : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z            : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x           : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y           : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z           : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm                : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm               : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm                 : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm         : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ var_accel_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_roll_arm         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_pitch_arm        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ avg_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stddev_yaw_arm          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ var_yaw_arm             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gyros_arm_x             : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y             : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z             : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x             : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y             : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z             : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x            : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y            : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z            : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ kurtosis_roll_arm       : Factor w/ 330 levels "","-0.02438",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_arm      : Factor w/ 328 levels "","-0.00484",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_arm        : Factor w/ 395 levels "","-0.01548",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_arm       : Factor w/ 331 levels "","-0.00051",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_pitch_arm      : Factor w/ 328 levels "","-0.00184",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_arm        : Factor w/ 395 levels "","-0.00311",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_roll_arm            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_arm           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_arm             : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_roll_arm      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_pitch_arm     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ amplitude_yaw_arm       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ roll_dumbbell           : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell          : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell            : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ kurtosis_roll_dumbbell  : Factor w/ 398 levels "","-0.0035","-0.0073",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_picth_dumbbell : Factor w/ 401 levels "","-0.0163","-0.0233",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ kurtosis_yaw_dumbbell   : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_roll_dumbbell  : Factor w/ 401 levels "","-0.0082","-0.0096",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_pitch_dumbbell : Factor w/ 402 levels "","-0.0053","-0.0084",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ skewness_yaw_dumbbell   : Factor w/ 2 levels "","#DIV/0!": 1 1 1 1 1 1 1 1 1 1 ...
##  $ max_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_picth_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ max_yaw_dumbbell        : Factor w/ 73 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ min_roll_dumbbell       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_pitch_dumbbell      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ min_yaw_dumbbell        : Factor w/ 73 levels "","-0.1","-0.2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ amplitude_roll_dumbbell : num  NA NA NA NA NA NA NA NA NA NA ...
##   [list output truncated]

Here we can see a large number of columns has N/A Values, so in order to quickly clean the data and remove the number of variables we are working with, we will remove the variables which have N/A proportions over 90%. Note that if we were carrying out a full analysis with less time constraints we might investigate to see if any of these variables, despite limited data, may improve model. But first we will use the very hand nearZeroVar() function to remove variables with little or no variation in their response by row, which will not be useful in predicting outcome

# CLEAN PART 1: USE nearZeroVar TO REDUCE NUMBER OF VARIABLES
remove_nzv <- nearZeroVar(df_train)
length(remove_nzv) #WILL REMOVE 60 VARIABLES
## [1] 60
df_train <- df_train[,-remove_nzv] #REMOVE COLUMNS WITH CLOSE TO ZERO VARIANCE
df_test <- df_test[, -remove_nzv]   #REPLICATE DATA PROCESSING IN TEST

# CLEAN PART 2: USE PROPORTION OF N/As or NULL TO FURTHER REDUCE NUMBER OF VARIABLES
NA_95pc <- sapply(df_train, function(x) mean(is.na(x))) > 0.95
summary(NA_95pc) #WILL REMOVE A FURTHER 41 VARIABLES
##    Mode   FALSE    TRUE 
## logical      59      41
df_train <- df_train[,NA_95pc == FALSE] #REMOVE COLUMNS WITH 95% NA FROM TRAIN
df_test <- df_test[,NA_95pc == FALSE] #REPLICATE DATA PROCESSING IN TEST
dim(df_train)
## [1] 19622    59
dim(df_test)
## [1] 20 59
# CLEAN PART 3: REMOVE IDENTIFICATION / TIMESTAMP COLUMNS
df_train <- df_train[, -c(1:6)]; df_train$classe <- as.factor(df_train$classe)
df_test <- df_test[,-c(1:6)]
str(df_train)
## 'data.frame':    19622 obs. of  53 variables:
##  $ roll_belt           : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt          : num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt            : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt    : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ gyros_belt_x        : num  0 0.02 0 0.02 0.02 0.02 0.02 0.02 0.02 0.03 ...
##  $ gyros_belt_y        : num  0 0 0 0 0.02 0 0 0 0 0 ...
##  $ gyros_belt_z        : num  -0.02 -0.02 -0.02 -0.03 -0.02 -0.02 -0.02 -0.02 -0.02 0 ...
##  $ accel_belt_x        : int  -21 -22 -20 -22 -21 -21 -22 -22 -20 -21 ...
##  $ accel_belt_y        : int  4 4 5 3 2 4 3 4 2 4 ...
##  $ accel_belt_z        : int  22 22 23 21 24 21 21 21 24 22 ...
##  $ magnet_belt_x       : int  -3 -7 -2 -6 -6 0 -4 -2 1 -3 ...
##  $ magnet_belt_y       : int  599 608 600 604 600 603 599 603 602 609 ...
##  $ magnet_belt_z       : int  -313 -311 -305 -310 -302 -312 -311 -313 -312 -308 ...
##  $ roll_arm            : num  -128 -128 -128 -128 -128 -128 -128 -128 -128 -128 ...
##  $ pitch_arm           : num  22.5 22.5 22.5 22.1 22.1 22 21.9 21.8 21.7 21.6 ...
##  $ yaw_arm             : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm     : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ gyros_arm_x         : num  0 0.02 0.02 0.02 0 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_arm_y         : num  0 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 ...
##  $ gyros_arm_z         : num  -0.02 -0.02 -0.02 0.02 0 0 0 0 -0.02 -0.02 ...
##  $ accel_arm_x         : int  -288 -290 -289 -289 -289 -289 -289 -289 -288 -288 ...
##  $ accel_arm_y         : int  109 110 110 111 111 111 111 111 109 110 ...
##  $ accel_arm_z         : int  -123 -125 -126 -123 -123 -122 -125 -124 -122 -124 ...
##  $ magnet_arm_x        : int  -368 -369 -368 -372 -374 -369 -373 -372 -369 -376 ...
##  $ magnet_arm_y        : int  337 337 344 344 337 342 336 338 341 334 ...
##  $ magnet_arm_z        : int  516 513 513 512 506 513 509 510 518 516 ...
##  $ roll_dumbbell       : num  13.1 13.1 12.9 13.4 13.4 ...
##  $ pitch_dumbbell      : num  -70.5 -70.6 -70.3 -70.4 -70.4 ...
##  $ yaw_dumbbell        : num  -84.9 -84.7 -85.1 -84.9 -84.9 ...
##  $ total_accel_dumbbell: int  37 37 37 37 37 37 37 37 37 37 ...
##  $ gyros_dumbbell_x    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ gyros_dumbbell_y    : num  -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
##  $ gyros_dumbbell_z    : num  0 0 0 -0.02 0 0 0 0 0 0 ...
##  $ accel_dumbbell_x    : int  -234 -233 -232 -232 -233 -234 -232 -234 -232 -235 ...
##  $ accel_dumbbell_y    : int  47 47 46 48 48 48 47 46 47 48 ...
##  $ accel_dumbbell_z    : int  -271 -269 -270 -269 -270 -269 -270 -272 -269 -270 ...
##  $ magnet_dumbbell_x   : int  -559 -555 -561 -552 -554 -558 -551 -555 -549 -558 ...
##  $ magnet_dumbbell_y   : int  293 296 298 303 292 294 295 300 292 291 ...
##  $ magnet_dumbbell_z   : num  -65 -64 -63 -60 -68 -66 -70 -74 -65 -69 ...
##  $ roll_forearm        : num  28.4 28.3 28.3 28.1 28 27.9 27.9 27.8 27.7 27.7 ...
##  $ pitch_forearm       : num  -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 ...
##  $ yaw_forearm         : num  -153 -153 -152 -152 -152 -152 -152 -152 -152 -152 ...
##  $ total_accel_forearm : int  36 36 36 36 36 36 36 36 36 36 ...
##  $ gyros_forearm_x     : num  0.03 0.02 0.03 0.02 0.02 0.02 0.02 0.02 0.03 0.02 ...
##  $ gyros_forearm_y     : num  0 0 -0.02 -0.02 0 -0.02 0 -0.02 0 0 ...
##  $ gyros_forearm_z     : num  -0.02 -0.02 0 0 -0.02 -0.03 -0.02 0 -0.02 -0.02 ...
##  $ accel_forearm_x     : int  192 192 196 189 189 193 195 193 193 190 ...
##  $ accel_forearm_y     : int  203 203 204 206 206 203 205 205 204 205 ...
##  $ accel_forearm_z     : int  -215 -216 -213 -214 -214 -215 -215 -213 -214 -215 ...
##  $ magnet_forearm_x    : int  -17 -18 -18 -16 -17 -9 -18 -9 -16 -22 ...
##  $ magnet_forearm_y    : num  654 661 658 658 655 660 659 660 653 656 ...
##  $ magnet_forearm_z    : num  476 473 469 469 473 478 470 474 476 473 ...
##  $ classe              : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
dim(df_test)
## [1] 20 53

Using these data cleansing approaches we are left with a much more manageable dataset with 19622 rows and 53 variables. The cleaning actions have been replicated in the test dataset

Back To Report

Session Info

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: i386-w64-mingw32/i386 (32-bit)
## Running under: Windows 7 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Ireland.1252  LC_CTYPE=English_Ireland.1252   
## [3] LC_MONETARY=English_Ireland.1252 LC_NUMERIC=C                    
## [5] LC_TIME=English_Ireland.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] DiagrammeR_1.0.0    xgboost_0.71.2      corrplot_0.84      
##  [4] randomForest_4.6-14 rpart.plot_3.0.6    rpart_4.1-13       
##  [7] rattle_5.2.0        caret_6.0-81        ggplot2_3.0.0.9000 
## [10] lattice_0.20-35    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.18       bindr_0.1.1        knitr_1.20        
##  [4] magrittr_1.5       MASS_7.3-50        splines_3.5.1     
##  [7] hms_0.4.2          tidyselect_0.2.4   viridisLite_0.3.0 
## [10] colorspace_1.3-2   R6_2.2.2           rlang_0.3.0.1     
## [13] brew_1.0-6         foreach_1.4.4      visNetwork_2.0.5  
## [16] Rook_1.1-1         gower_0.1.2        influenceR_0.1.0  
## [19] withr_2.1.2        iterators_1.0.10   htmltools_0.3.6   
## [22] class_7.3-14       recipes_0.1.4      assertthat_0.2.0  
## [25] rprojroot_1.3-2    digest_0.6.17      tibble_1.4.2      
## [28] Matrix_1.2-14      rmarkdown_1.10     downloader_0.4    
## [31] compiler_3.5.1     pillar_1.3.0       scales_1.0.0      
## [34] backports_1.1.2    generics_0.0.2     stats4_3.5.1      
## [37] jsonlite_1.5       lubridate_1.7.4    pkgconfig_2.0.2   
## [40] igraph_1.2.2       rstudioapi_0.7     munsell_0.5.0     
## [43] prodlim_2018.04.18 plyr_1.8.4         dplyr_0.7.6       
## [46] stringr_1.3.1      tools_3.5.1        grid_3.5.1        
## [49] nnet_7.3-12        ipred_0.9-8        nlme_3.1-137      
## [52] timeDate_3043.102  data.table_1.11.8  gtable_0.2.0      
## [55] lazyeval_0.2.1     yaml_2.2.0         survival_2.42-3   
## [58] crayon_1.3.4       bindrcpp_0.2.2     gridExtra_2.3     
## [61] lava_1.6.4         tidyr_0.8.1        purrr_0.2.5       
## [64] readr_1.2.1        reshape2_1.4.3     RColorBrewer_1.1-2
## [67] viridis_0.5.1      ModelMetrics_1.2.2 codetools_0.2-15  
## [70] htmlwidgets_1.3    glue_1.3.0         evaluate_0.11     
## [73] rgexf_0.15.3       stringi_1.1.7      XML_3.98-1.16

Back To Report